University of Maryland, College Park

VAST 2011 Challenge
Mini-Challenge 1 - Characterization of an Epidemic Spread
Authors and Affiliations:
Darya Filippova, University of Maryland, dfilippo@cs.umd.edu [PRIMARY contact]
Carl Kingsford, University of Maryland, carlk@umiacs.umd.edu [Faculty advisor]
Tool(s):
We used Python, NLTK library (collection of tools for natural language processing), numpy (scientific computing tools), and matplotlib (python plotting) all of which are publicly available. Writing a script for parsing and filtering tweets requires some programming experience, but no specific knowledge is needed. Modifying scripts to investigate a particular peculiarity in data is easy and straightforward.
Video:
ANSWERS:

MC 1.1 Origin and Epidemic Spread: Identify approximately where the outbreak started on the map (ground zero location). If possible, outline the affected area. Explain how you arrived at your conclusion.
Our analysis shows that the epidemic started on May 18th, likely near the Vastopolis Dome and Convention Center. We determined this by parsing and filtering tweets (via Python+NLTK script). We excluded tweets that did not contain words related to sickness from the task description:
flu sick fever cough sweat ache pain fatigue nausea vomitting diarrhea lymph

We overlaid a heatmap of the remaining 42275 tweets on the Vastopolis map which highlighted areas of Downtown in red. We then divided the tweets by day and looked at daily heatmaps. Daily analysis revealed a spike in the number of 'sick' tweets on 18th of May. Looking at hourly tweet counts on May 18th-20th, we observed that tweet counts increased significanly on May 18th at 8a.m. (see Fig. 1). The heatmap analysis for May 18th confirms that disease started spreading on that day from the Downtown area (see Fig. 2). In particular, there was a significant number of tweets from around Vastopolis Dome, Convention Center, and Vastopolis City Hospital.

sick tweets counts

Figure 1. Hourly tweet counts on May 18th - May 20th

log of sick tweets

Figure 2. Heatmap of tweet counts for May 18th at 8a.m.


MC 1.2 Epidemic Spread: Present a hypothesis on how the infection is being transmitted. For example, is the method of transmission person-to-person, airborne, waterborne, or something else? Identify the trends that support your hypothesis. Is the outbreak contained? Is it necessary for emergency management personnel to deploy treatment resources outside the affected area? Explain your reasoning.
While heatmap is good for identifying locations with highest tweet counts, such hotspots may oveshadow medium-count locations that are of equal importance. To give less weight to the highest counts and highlight locations with lower tweet counts, we plotted a log(c+1) where c is a tweet count in a given cell of the heatmap grid. Figure 3 shows that, apart from Downtown, part of Eastside and the banks of the Vast river get contaminated. Hospitals in each of the burrows stand out as well.
log of sick tweets

Figure 3. Heatmap of a log of tweet counts for May 18th at 8a.m.

A plot of tweet log maps over 3 days shows how "sick" tweet distribution changes over time. We plotted tweet log heatmaps for every day (Fig. 4 shows three last days of the data when disease really takes off). On May 18th, the disease started in the Downtown area and spread to the Eastside - possibly carried by the strong wind from the West (as indicated in the Weather.csv). Sick tweets started dispersing through the burrows in the evening after 5p.m. (evening commute from center to the suburbs) and were published at a steady rate throughout the night (see Fig. 1 above). On May 19th, we observed further disease spread in the westward direction due to the strong wind. However, there was a significant number of "sick" tweets along the banks of Vast river downstream of infected Downtown. (We determined the direction of the river by the discharge on one of the dams).

may 18

May 18th

may 19

May 19th

may 20

May 20th

Figure 4. Daily logarithmic heatmaps of tweet counts related to sickness.

These two observations made us hypothesize that the disease may spread by air as well as by water. While evidence for disease being airborne is strong, we were puzzled to see that there was no increase in "sick" tweets around other bodies of water in the city (lakes Pasta, Bread, Rice, Disco, and Twin Lakes). However, it is better to be safe than sorry, so we conclude that both ways of disease spread are highly likely.

River contamination may propagate disease further downstream and infect other cities, so our advice to emergency management personnel is to treat the river water and to stop the water intake from the river downstream of Vastopolis.
healthysick% sick
before7392949666.3
after672092167824.4

Table 1. Tweet counts mentioning sickness vs. all other tweets before and after May 18th

Table 1 shows tweet count that contained references to sickness as well as "healthy" tweet counts before and after May 18th. A significant increase in "sick tweets" (24.4% of all tweets in 3 days) indicates that this new disease started an epidemic and its elimination may require drastic measures. Since its spontaeous onset on May 18th, the disease spread fast and infected a quarter of the population - all facts indicate that the disease is highly contagious and the outbreak can spread outside the city easily.

Symptoms and the method of transmission remind those of the West Nile virus: the virus is spread by mosquitoes, which could be carried by the wind, but also more prevalent near water.